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Abstract 

In this paper we explore avenues for improving the reliability 
of dimensionality reduction methods such as Non-Negative 
Matrix Factorization (NMF) as interpretive exploratory data 
analysis tools. We first explore the difficulties of the op- 
timization problem underlying NMF, showing for the first 
time that non-trivial NMF solutions always exist and that 
the optimization problem is actually convex, by using the 
theory of Completely Positive Factorization. We subse- 
quently explore four novel approaches to finding globally- 
optimal NMF solutions using various ideas from convex op- 
timization. We then develop a new method, isometric NMF 
(isoNMF), which preserves non- negativity while also provid- 
ing an isometric embedding, simultaneously achieving two 
properties which are helpful for interpretation. Though it 
results in a more difficult optimization problem, we show ex- 
perimentally that the resulting method is scalable and even 
achieves more compact spectra than standard NMF. 

1 Introduction. 

In this paper we explore avenues for improving the relia- 
bihty of dimensionality reduction methods such as Non- 
Negative Matrix Factorization (NMF) [21] as interpre- 
tive exploratory data analysis tools, to make them reli- 
able enough for, say, making scientific conclusions from 
astronomical data. 

NMF is a dimensionality reduction method of much 
recent interest which can, for some common kinds of 
data, sometimes yield results which are more meaning- 
ful than those returned by the classical method of Prin- 
cipal Component Analysis (PC A), for example (though 
it will not in general yield better dimensionality reduc- 
tion than PCA, as we'll illustrate later). For data of 
significant interest such as images (pixel intensities) or 
text (presence/absence of words) or astronomical spec- 
tra (magnitude in various frequencies), where the data 
values are non-negative, NMF can produce components 
which can themselves be interpreted as objects of the 
same type as the data which are added together to pro- 
duce the observed data. In other words, the components 
are more likely to be sensible images or documents or 
spectra. This makes it a potentially very useful inter- 
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pretive data mining tool for such data. 

A second important interpretive usage of dimen- 
sionality reduction methods is the plot of the data points 
in the low-dimensional space obtained (2-D or 3-D, 
generally). Multidimensional scaling methods and re- 
cent nonlinear manifold learning methods focus on this 
usage, typically enforcing that the distances between 
the points in the original high-dimensional space are 
preserved in the low-dimensional space (isometry con- 
straints). Then, apparent relationships in the low-D 
plot (indicating for example cluster structure or out- 
liers) correspond to actual relationships. A plot of the 
points using components found by standard NMF meth- 
ods will in general produce misleading results in this 
regard, as existing methods do not enforce such a con- 
straint. 

Another major reason that NMF might not yield 
reliable interpretive results is that current optimization 
methods [TSl [52] might not find the actual optimum, 
leading to poor performance in terms of both of the 
above interpretive usages. This is because its objective 
function is not convex, and so unconstrained optimizers 
are used. Thus, obtaining a reliably interpretable 
NMF method requires understanding its optimization 
problem more deeply - especially if we are going to 
actually create an additionally difficult optimization 
problem by adding isometry constraints. 

1.1 Paper Organization. In Section [2] we first 
study at a fundamental level the optimization problem 
of standard NMF. We relate for the first time the NMF 
problem to the theory of Completely Positive Factoriza- 
tion, then using that theory, we show that every non- 
negative matrix has a non-trivial exact non-negative 
matrix factorization of the form W=VH, a basic fact 
which had not been shown until now. Using this the- 
ory we also show that a convex formulation of the NMF 
optimization problem exists, though a practical solu- 
tion method for this formulation does not yet exist. 
We then explore four novel formulations of the NMF 
optimization problem toward achieving a global opti- 
mum: convex relaxation using the positive semidefinite 
cone, approximating the semidefinite cone with smaller 
ones, convex multi-objective optimization, and general- 
ized geometric programming. We highlight the difficul- 



ties encountered by each approach. 

In order to turn to the question of how to create a 
new isometric NMF, in Section [3] we give background on 
two recent successful manifold learning methods, Maxi- 
mum Variance Unfolding (MVU) [S^ and a new variant 
of it, Maximum Furthest Neighbor Unfolding (MFNU) 
pS] , It has been shown experimentally [M] that they 
can recover the intrinsic dimension of a dataset very re- 
liably and effectively, compared to previous well-known 
methods such as ISOMAP [21], Laplacian Eigen-Maps 
[2] and Diffusion Maps [6' . In synthetic experiments the 
above methods manage to decompose data into mean- 
ingful dimensions. For example, for a dataset consisting 
of images of a statue photographed from different hor- 
izontal and vertical angles, MVU and MFNU find two 
dimensions that can be identified as the horizontal and 
the vertical camera angle. MVU and MFNU contain 
ideas, particularly concerning the formulation of their 
optimization problems, upon which isometric NMF will 
be based. 

In Section [5] we show a practical algorithm for an 
isometric NMF {isoNMF for short), representing a new 
data mining method capable of producing both inter- 
pretable components and interpretable plots simultane- 
ously. We use ideas for efficient optimization and effi- 
cient neighborhood computation to obtain a practical 
scalable method. 

In Section [5] we demonstrate the utility of isoNMF 
in experiments with datasets used in previous papers. 
We show that the components it finds are comparable 
to those found by standard NMF, while it additionally 
preserves distances much better, and also results in 
more compact spectra. 



2 Convexity in 
Factorization. 



Non Negative Matrix 



Given a non- negative matrix V e the goal of 

NMF is to decompose it in two matrices W G SR^^^'', 
H e SR*^""" such that V = WH. Such a factorization 
always exists for k > m. The factorization has a trivial 
solution where W — V and H = /,„. Determining 
the minimum A; is a difficult problem and no algorithm 
exists for finding it. In general we can show that NMF 
can be cast as a Completely Positive (CP) Factorization 
problem [3]. 

Definition 2.1. A matrix A e 3?^^^ is Completely 
Positive if it can be factored in the form A — BB"'" , 
where 5 6 3?^'"=. The minimum k for which A — BB 
holds is called the CP rank of A. 

Not all matrices admit a completely positive factoriza- 
tion even if they are positive definite and non-negative. 
Notice though that for every positive definite non- 



negative matrix a Cholesky factorization always exists, 
but there is no guarantee that the Cholesky factors are 
non-negative too. Up to now there is no algorithm of 
polynomial complexity that can decide if a given posi- 
tive matrix is CP. A simple observation can show that 
A has to be positive definite, but this is a necessary and 
not a sufficient condition. 

Theorem 2.1. If A e 3?+''^ is CP then rankiA) < 
cp-rank(A) < ^^^^ - 1 

The proof can be found in [3]p.l56. It is also conjectured 
that the upper bound can be tighter 



Theorem 2.2. if A e ^ 
then it is also CP. 



NxN 



diagonally dominant, 



The proof of the theorem can be found in [T7] . Although 
CP factorization {A — BB^) doesn't exist for every 
matrix, we prove that non-trivial NMF {A — WH) 
always exists. 

Theorem 2.3. Every non-negative matrix V G 3?^^'" 
has a non-trivial, non-negative factorization of the form 
V = WH. 



Proof. Consider the following matrix: 
(2.1) Z = 



D V 

E 



We want to prove that there always exists B G ^^^^ 



such that Z ^ BB'^. 
form: 

(2.2) 



If this is true then B can take the 



B = 



W 
H^ 



Notice that if D and E are arbitrary diagonally domi- 
nant completely positive matrices, then B always exists. 
The simplest choice would be to chose them as diagonal 
matrices where each element is greater or equal to the 
sum of rows/columns of V. Since they are diagonally 
dominant according to 12.21 Z is always CP. Since Z is 
CP then B exists so do W and H. □ 



Although theorem 12.21 also provides an algorithm for 
constructing the CP-factorization, the cp-rank is usually 
high. A corollary of theorems 12.11 (cp-rank(^) > 
rank(^)) and [Ql (existence of NMF) is that SVD has 
always a more compact spectrum than NMF. 

There is no algorithm known yet for computing an 
exact NMF despite its existence. In practice, scientists 
try to minimize the norm [121 122] of the factorization 
error. 

(2.3) mmllV -WH\\2 

W,H 

'^A matrix is diagonally dominant if an > ^j^i 



This is the objective function we use in the experiments 
for this paper. 

2.1 Solving the optimization problem of NMF. 

Although in the current Uterature it is widely believed 
that NMF is a non-convex problem and only local min- 
ima can be found, we will show in the following subsec- 
tions that a convex formulation does exist. Despite the 
existence of the convex formulation, we also show that 
a formulation of the problem as a generalized geomet- 
ric program, which is non-convex, could give a better 
approach for finding the global optimum. 

2.1.1 NMF as a convex conic program. 

Theorem 2.4. The set of Completely Positive Matri- 
ces IS a convex cone. 

Proof. See [3] p. 71. 

It is always desirable to find the minimum rank of NMF 
since we are looking for the most compact representa- 
tion of the data matrix V . Finding the minimum rank 
NMF can be cast as the following optimization problem: 



(2.4) 



min rank 

W.H 



W V 

n 

subject to 



(2.5) 



2.2 Convex relaxations of the NMF problem. 

In the following subsections we investigate convex relax- 
ations of the NMF problem with the Positive Semidefi- 
nite Cone [53]. 

2.2.1 A simple convex upper bound with Sin- 
gular Value Decomposition. Singular Value Decom- 
position (SVD) can decompose a matrix in two factors 
U,V: 

(2.9) A = UV 

Unfortunately the sign of the SVD components of A > 
cannot be guaranteed to be non-negative except for the 
first eigenvector [24]. However if we project U,V on 
the nonnegative orthant {U, > 0) we get a very good 
estimate (upper bound) for NMF. We will call it clipped 
SVD, (CSVD). CSVD was used as a benchmark for the 
relaxations that follow. It has also been used as an 
initializer for NMF algorithms ,20J. 

2.2.2 Relaxation with a positive semidefinite 
cone. In the minimization problem of eq. 12.2.21 where 
the cost function is the L2 norm, the nonlinear terms 
Wiihij appear. A typical way to get these terms [26j 
would be to generate a large vector z = [W' {:); H{: 
)], where we use the MATLAB notation {H{:) is the 
column- wise unfolding of a matrix). li Z = zz^ 
(rank{Z) — 1) and z > is true, then the terms 
appearing in 1|V^ — H^i?||2 are linear in Z. In the 
following example eq. I2.10[ 12.111 (see next page) where 
V G sR2x3^ ^ g SR2X2 g sfj2x3 ^^low the structure 

of Z. Terms in bold are the ones we need to express the 
constraint V = WH, i.e vn — wuhu + 1012/121- 



Since minimizing the rank is non-convex, we can use its 
convex envelope that according to [28 is the trace of the 
matrix. So a convex relaxation of the above problem is: 



(2.6) 
(2.7) 

(2.8) 



mm 



Trace( 



V 

n 

subject to: 



(2.10) 



Wii 

W12 
W21 
W22 
hii 

h2i 
hi2 

h22 

hi3 

h23 



Now the optimization problem eq. is equivalent to: 



After determining W, Ti, W and H can be recovered 
by CP factorization of W, 7i, which again is not an 
easy problem. In fact there is no practical barrier 
function known yet for the CP cone so that Interior 
Point Methods can be employed. Finding a practical 
description of the CP cone is an open problem. So 
although the problem is convex, there is no algorithm 
known for solving it. 



(2.12) min 



=N,j= 

E 

1=1,3 = 1 



! fe 

1=1 



Hk+l,Nk+jk+l 



subject to: 
rank(Z) = 1 

This is not a convex problem but it can be easily 
be relaxed to [3: 



(2.11) Z = 



r 2 


11)1111)12 


W11W21 


11)1111)22 


wiihii 


Wllh21 


Wllhl2 


Wllh22 


wiihia 


Wllh23 






11)1211)21 


W12W22 


Wi2hii 


Wl2h21 


1Dl2hl2 


Wl2h22 


11)l2hi3 


Wl2h23 




W21W12 


■W21 


11)2111)22 


W2lhii 


^"21^21 


W2lhi2 


1D2lh22 


W2lhi3 


W2lh23 


U'22Wll 


W22W12 


W22W21 


w'22 


W22/111 


W22h21 


1I)22hl2 


W22h22 


1022/113 


W22h23 


ftlllUll 


fellTOl2 


hli11)21 


hllW22 


hll 


hiih2i 


hiihi2 


hiih22 


hiihi3 


hiih23 




h2l11)i2 


h2l1D21 


h2l1D22 


h2ihii 


hli 


h2ihi2 


^21^22 


h2ihi3 


h2lh23 


hi2Wii 


hi2ll)i2 


hl2lD21 


^12^22 


hi2hii 


hi2h2i 


hl2 


^12^22 


hi2hi3 


hl2h23 


h22Wii 


h22ll)i2 


h22lD21 


h22lD22 


h22hii 


/!-22^21 


h.22hl2 


hl2 


h22hi3 


h22h.23 


hl3Wll 


hl3W12 


hls1D21 


hl3lD22 


hnhii 


hish21 


hi3hi2 


hl3h22 


his 


hish23 


/123W11 


h23ll)i2 


^23^21 


h23W22 


h2zhii 


h2ih2l 


h23hl2 


h23h22 


h23hl3 





(2.13) 



min Tracc(Z) 
subject to: 
A»Z 

Z 
Z 
Z 



h 
> 



where A is a matrix that selects the appropriate ele- 
ments from Z . Here is an example for a matrix A that 
selects the elements of Z that should sum to the V13 
element: 



(2.14) = 







1 

1 








In the second formulation (|2.13p we have relaxed 
Z = zz^ with Z > zz^ . The objective function tries to 
minimize the rank of the matrix, while the constraints 
try to match the values of the given matrix V . After 
solving the optimization problem the solution can be 
found on the first eigenvector of Z . The quality of the 
relaxation depends on the ratio of the first eigenvalue 
to sum of the rest. The positivity of Z will guarantee 
that the first eigenvector will have elements with the 
same sign according to the Peron Frobcnious Theorem 
[24] . Ideally if the rest of the eigenvectors are positive 
they can also be included. One of the problems of this 
method is the complexity. Z is {N+ni)k x {N+m)k and 
there are non- negative constraints. 

Very quickly the problem becomes unsolvable. 

In practice the problem as posed in l2.12l alwavs gives 
W and H matrices that are rank one. After testing 
the method exhaustively with random matrices V that 
either had a product V = WH representation or not the 
solution was always rank one on both W and H . This 
was always the case with any of the convex formulations 
presented in this paper. This is because there is a 



missing constraint that will let the energy of the dot 
products spread among dimensions. This is something 
that should characterize the spectrum of H . 

The H matrix is often interpreted as the basis vec- 
tors of the factorization and W as the matrix that has 
the coefficients. It is widely known that in nature spec- 
tral analysis is giving spectrum that decays either expo- 
nentially e^^^ or more slowly 1//'''. Depending on the 
problem we can try different spectral functions. In our 
experiments we chose the exponential one. Of course 
the decay parameter A is something that should be set 
adhoc. We experimented with several values of A, but 
we couldn't come up with a systematic, heuristic and 
practical rule. In some cases the reconstruction error 
was low but in some others not. Another relaxation that 
was necessary for making the optimization tractable was 
to reduce the the non-negativity constraints only on the 
elements that are involved in the equality constraints. 

2.2.3 Approximating the SDP cone with 
smaller ones. A different way to deal with the compu- 
tational complexity of SDP (eq. I2.13P is to approximate 
the big SDP cone {N + m)k x {N + m)k with smaller 
ones. Let Wi be the ith row of W and Hj the jth column 
of H. Now Zij = [Wi{:y;Hj{:)] (2k dimensional vector) 



and Zi. 



(2.15) 



{2k X 2k matrix), or 



Zi. 



or it is better to think it in the form: 



(2.16) 



W, ZwH 

ZwH 'Hj 



and once yV,7i are found then Wi,Hj can be found 
from SVD decomposition of W,H and the quality of 
the relaxation will be judged upon the magnitude of 
the first eigenvalue compared to the sum of the others. 
Now the optimization problem becomes: 

N m 



(2.17) min^^Trace(Zy) 



.1 j=i 



Z,, > 



An immediate consequence is that 



y 



Aij • Ztj = Vij, Vi, j 

The above method has Nm constraints. In terms of 
storage it needs 

• (N + m) symmetric positive definite k x k ma- 
trices for every row/column of W, H, which is 

{N+m)k{k+l) 
2 



• Nm symmetric positive definite k x k matrices for 

(jVm)fc(fc+l) 
2 



every WiHj product, which is 



In total the storage complexity is 0{{N + m + 
Nm) ) which is significantly smaller by an order 

of magnitude from 0{ ^ which is the 

complexity of the previous method. There is also sig- 
nificant improvement in the computational part. The 
SDP problem is solved with interior point methods ^26j 
that require the inversion of a symmetric positive def- 
inite matrix at some point. In the previous method 
that would require 0{{N + m)^k^) steps, while with 
this method we have to invert Nm 2k x 2k matrices, 
that would cost Nm{2k)^. Because of their special 
structure the actual cost is {Nm)k^ ~\- ma.x{N,m)k^ — 
{Nm + max(iV, m))k^ . 

We know that >V^,7^J ^ 0. Since is PSD and 
according to Schur's complement on eq. 12.161 



(2.18) 



WH 



>- 



So instead of inverting (|2.16p that would cost 8k^ we 
can invert 12.181 This formulation gives similar results 
with the big SDP cone and most of the cases the results 
are comparable to the CSVD. 

2.2.4 NMF as a convex multi-objective prob- 
lem. A different approach would be to find a convex 
set in which the solution of the NMF lives and search 
for it over there. Assume that we want to match 
Vij = WiHj = WiiHij. In this section we show 

that by controlling the ratio of the L2/L1 norms of 
W,H it is possible to find the solution to NMF. De- 
fine WiiHij = Vij^i and Ya=i ^j,' = • We form the 
following matrix that we require to be PSD: 



(2.19) 



1 


Wu 


Hi, 


Wa 


til 


Vr,,l 




Vr,,l 


t,l 



>- 



If we use the Schur complement we have: 



(2.20) 



WuHi, 



WaHij 



>- 



(2.21) ta 

(2.22) tji 
{2.23){tu-Wi){t,i-Hf^) 



> 
> 
> 



Wi 

m 



In the above inequality we see that the L2 error 
Y^iLi LjLi J2'i=iiVij,i-WuHij)'^ becomes zeros iitu = 
Wfi^tji = H^i, \/tii,tji. In general we want to min- 
imize t while maximizing and ||-ff|p. L2 norm 
maximization is not convex, but instead we can max- 
imize Hij which are equal to the Li norms 
since everything is positive. This can be cast as convex 
multi-objective problem on the second order cone |4]. 



i=i 1^1=1 ''ii + ^1=1 '■'i 

, = 1 1^1 = 1^1, 



(2.24) 



subject to : 

ta - Wi V,,, - 
Vij,i — WiiHij tji 



WuHi, 



>- 



Unfortunately multi-objective optimization problems, 
even when they are convex, they have local minima that 
are not global. An interesting direction would be to test 
the robustness of existing multi-objective algorithms on 
NMF. 

2.2.5 Local solution of the non-convex problem. 

In the previous sections we gave several convex formu- 
lations and relaxations of the NMF problem that unfor- 
tunately are either unsolvable or they give trivial rank 
one solutions that are not useful at all. 

In practice the non-convex formulation of eq. 12.2.21 
(classic NMF objective) along with other like the KL 
distance between V and WH are used in practice [22]. 
All of them are non-convex and several methods have 
been recommended, such as alternating least squares, 
gradient decent or active set methods [18]. In our 
experiments we used the L-BFGS method that scales 
very well for large matrices. 

2.2.6 NMF as a Generalized Geometric Pro- 
gram and it's Global Optimum. The objective 
function feq. l2.2.2p can be written in the following form: 

N m k 

{2.25) \\V - WH\\2 = EEE - WuHi,)' 
i=i j=i 1=1 

The above function is twice differentiable so according 
to [TT] the function can be cast as the difference 
of convex (d.c.) functions. The problem can be 



also known as vector optimization 



solved with general off-the-shelf global optimization 
algorithms. It can also be formulated as a special 
case of dc programming, the generalized geometric 
programming. With the following transformation Wu — 
e™*' , Hij — e'''J the objective becomes: 

N m k 2 

(2.26) \\V-WH\\2 = J2Y.J2 {^^^ - e^'-'^^'^) 
1=1 j=i 1=1 

N m N m / k 

i=i j=i i=i j=i \i=i J 

N m / k \ 

-2EE^^E^"""'" 

i=i j=i \;=i / 

The first term is constant and it can be ignored for the 
optimization. The other two terms: 

N m / k \ ^ 

{2.21)f{wuM,) = EE E^''"^'"M 
i=l j=l \l=l ) 

N m / k \ 

i=l j=l \l=l I 

are convex functions also known as the exponential form 
of posynomials [5- For the global solution of the 
problem 

(2.29) min j{W ,H) ~ giy^ ,H) 

W,H 

the algorithm proposed in [8J can be employed. The 
above algorithm uses a branch and bound scheme that is 
impractical for high dimensional optimization problems 
as it requires too many iterations to converge. It is 
worthwhile though to compare it with thelocal non- 
convex NMF solver on a small matrix. We tried to do 
NMF of order 2 on the following random matrix: 

" 0.45 0.434 0.35 " 
0.70 0.64 0.43 
0.22 0.01 0.3 

After 10000 restarts of the local solver the best error 
we got was 2.7% while the global optimizer very quickly 
gave 0.015% error, which is 2 orders of magnitude less 
than the local optimizer. 

Another direction that is not investigated in this 
paper is the recently developed algorithm for Difference 
Convex problems by Tao [30] that has been applied 
successfully to other data mining applications such as 
Multidimensional Scaling. [1]. 



3 Isometric Embedding 

The key concept in Manifold Learning (ML)is to repre- 
sent a dataset in a lower dimensional space by preserving 
the local distances. The differences between methods 
Isomap [31 , Maximum Variance unfolding (MVU) [33] , 
Laplacian EigenMaps |2j and Diffusion Maps [6] is how 
they treat distances between points that are not in the 
local neighborhood. For example IsoMap preserves ex- 
actly the geodesic distances, while Diffusion Maps pre- 
serves distances that are based on the diffusion kernel. 
Maximum Furthest Neighbor Unfolding (MFNU) [25] 
that is a variant of Maximum Variance Unfolding, pre- 
serves local distance and it tries to maximize the dis- 
tance between furthest neighbors. In this section we 
are going to present the MFNU method as it will be the 
basis for building isoNMF. 

3.1 Convex Maximum Furthest Neighbor Un- 
folding. Weinberger formulated the problem of isomet- 
ric unfolding as a Semidefinite Programming algorithm 
|33j^ . In [T9| Kulis presented a non-convex formula- 
tion of the problem that requires less memory than the 
Semidefinite one. He also claimed that the non-convex 
formulation is scalable. The non-convex formulation has 
the same global optimum with the Semidefinite one as 
proven in ^5>. In ■25j Vasiloglou presented experiments 
where he verified the scalability of this formulation. A 
variant of MVU the Maximum Furthest Neighbor Un- 
folding (MFNU) was also presented in the same paper. 
The latest formulation tends to be more robust and scal- 
able than MVU, this is why we will employ it as the basis 
of isoNMF. Both methods can be cast as a semidefinite 
programming problem '32]. 

Given a set of data X e K^^'*, where N is the 
number of points and d is the dimensionality, the dot 
product or Gram matrix is defined as G = XX^ . 
The goal is to find a new Gram matrix K such that 
rank{K) < rank{G) in other words K = XX^ 
where X € and d' < d. Now the dataset 

is represented by X which has fewer dimensions than 
X. The requirement of isometric unfolding is that the 
euclidian distances in the SR'* for a given neighborhood 
around every point have to be the same as in the di'^. 
This is expressed in: 

Kii + Kjj — Kij — Kji — Gii + Gjj ^ Gij — Gji j G li 

where li is the set of the indices of the neighbors of 
the ith point. From all the K matrices MFNU chooses 
the one that maximizes the distances between furthest 



^Posynomial is a product of positive variables exponentiated 
in any real number 



*A similar approach to learning metrics is given in )29| 



neighbor pairs. So the algorithm is presented as an SDP: 



N 



N 



max > Bi • K 

K ^ 

1=1 

subject to 
A^j • K 



K > Q 



where the A • X — Tia,ce{AX^) is the dot product 
between matrices. Aij has the following form: 



(3.30) 



1 

'■• 

: 



-1 







-1 ... 

.. 

. 

. 







Gii + Gjj — Gij — Gj 



and 

(3.31) dij — ^ It I ^ 3^ 

Bi has the same structure of Aij and computes the 
distance of the ith point with its furthest neighbor. 
The last condition is just a centering constraint for 
the covariance matrix. The new dimensions X are 
the eigenvectors oi K. In general MFNU gives Gram 
matrices that have compact spectrum, at least more 
compact than traditional linear Principal Component 
Analysis (PC A). The method behaves equally well with 
MVU. Both MVU and MFNU are convex so they 
converge to the global optimum. Unfortunately this 
method can handle datasets of no more than hundreds 
of points because of its complexity. In the following 
section a non-convex formulation of the problem that 
scales better is presented. 

3.2 The Not! Convex Maximum Furthest 
Neighbor Unfolding. By replacing the constraint 
iiT ^ [5j with an explicit rank constraint K — RR^ 
the problem becomes non-convex and it is reformulated 
to 



N 



(3.32) 



max Bi • RR^ 

i=l 

subject to: 
Aij • RR^ = 



In [S] , Burer proved that the above formulation has the 
same global minimum with the convex one. 

The above problem can be solved with the aug- 
mented Lagrangian method [27] . 
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(3.33) C = ~^B,»RR^ 
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Our goal is to minimize the Lagrangian that's why the 
objective function is —RRF and not RRF 

The derivative of the augmented Lagrangian is: 



(3.34) 
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Gradient descent is a possible way to solve the mini- 
mization of the Lagrangian, but it is rather slow. The 
Newton method is also prohibitive. The Hessian of this 
problem is a sparse matrix although the cost of the 
inversion might be high it is worth investigating. In 
our experiments we used the limited memory BFGS (L- 
BFGS) method [23l |27] that is known to give a good 
rate for convergence. MFNU in this non-convex formu- 
lation behaves much better than MVU. In the experi- 
ments presented in [5S] , MFNU tends to find more often 
the global optimum, than MVU. The experiments also 
showed that the method scales well up to lOOK points. 

4 Isometric NMF. 

NMF and MFNU are optimization problems. The goal 
of isoNMF is to combine these optimization problems 
in one optimization problem. MFNU has a convex and 
a non-convex formulation, while for NMF only a non- 
convex formulation that can be solved is known. 

4.1 Convex isoNMF. By using the theory pre- 
sented in section [2. 1.1 1 we can cast isoNMF as a convex 
problem: 

N 



(4.35) 



max 

W,H 



Z 



subject to: 
A,j mW ^, 

W 

W e Kf^ 



V 

H 



Then W, H can be found by the completely positive 
factorization of W = WW^,H = iJiJ^. Again this 
problem although it is convex, there is no polynomial 
algorithm known for solving it. 

4.2 Non-convex formulation of isoNMF. The 

non convex isoNMF can be cast as the following prob- 
lem: 

N 

(4.36) max ^B^^WW'^ 

i=l 

subject to: 

WH = V 

W>Q 

H>0 

The augmented lagrangian with quadratic penalty func- 
tion is the following: 
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The non-negativity constraints are missing from the 
Lagrangian. This is because we can enforce them 
through the limited bound BFGS also known as L- 
BFGS-B. The derivative of the augmented Lagrangian 
is: 

ir ^ 
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4.3 Computing the local neighborhoods. As al- 
ready discussed in previous section MENU and isoNMF 
require the computation of all-nearest and all-furthest 
neighbors. The all-nearest neighbor problem is a special 
case of a more general class of problems called N-body 
problems [10]. In the following sections we give a sort 
description of the nearest neighbor computation. The 
actual algorithm is a four-way recursion. More details 
can be found in TOj . 

4.3.1 Kd-tree. The kd-tree is a hierarchical parti- 
tioning structure for fast nearest neighbor search [9]. 
Every node is recursively partitioned in two nodes un- 
til the points contained are less than a fixed number. 
This is a leaf. Nearest neighbor search is based on a top 
down recursion until the query point finds the closest 
leaf. When the recursion hits a leaf then it searches lo- 
cally for a candidate nearest neighbor. At this point we 
have an upper bound for the nearest neighbor distance, 
meaning that the true neighbor will be at most as far 
away as the candidate one. As the recursion backtracks 
it eliminates (prunes) nodes that there are further away 
than the candidate neighbor. Kd-trees provide on the 
average nearest neighbor search in O(logA^) time, al- 
though for pathological cases the kd-tree performance 
can asymptotically have linear complexity like the naive 
method. 

4.3.2 The Dual Tree Algorithm for nearest 
neighbor computation. In the single tree algorithm 
the reference points are ordered on a kd-tree. Every 
nearest neighbor computation requires 0(log(A'^)) com- 
putations. Since there are N query points the total 
cost is 0(iVlog(A^)). The dual-tree algorithm [10] or- 
ders the query points on a tree too. If the query set 
and the reference set are the same then they can share 
the same tree. Instead of querying a single point at a 
time the dual-tree algorithm always queries a group of 
points that live in the same node. So instead of do- 
ing the top-down recursion individually for every point 
it does it for the whole group at once. Moreover in- 
stead of computing distances between points and nodes 
it computes distances between nodes. This is the rea- 



son why most of the times the dual-tree algorithm can 
prune larger portions of the tree than the single tree 
algorithm. The complexity of the dual-tree algorithm 
is empirically 0{N). If the dataset is pathological then 
the algorithm can be of quadratic complexity too. The 
pseudo-code for the algorithm is described in fig. [TJ 



recurse(q : KdTree, r : KdTree) { 

if (inax_nearest_neighbor_distaiice_in_node (q) 
< distance(q, r) { 
/* prune */ 

} else if (IsLeaf (q)==true and IsLeaf (r)==true) { 
/* search for every point in q node */ 
/* its nearest neighbor in the r node */ 
/* at leaves we must resort to */ 
/* exhaustive search 0(n"2) */ 
/*update the iiiaxiiiiuin_neighbor_distance_in_node (q) */ 

} else if (IsLeaf (q)==false and IsLeaf (r)=true { 
/* choose the child that is closer to r */ 
/* and recurse first */ 
recurse(closest(r, q.left, q.right) , r) 
recurse (furthest (r, q.left, q.right), r) 

} else if (IsLeaf (q)==true and IsLeaf (r)==false) { 
/* choose the child that is closer to q */ 
/* and recurse first */ 
recurse(q, closest(q, r.left, r. right)) 
recurse (q, furthest (q, r.left, r. right)) 

} else { 

recurse (q. left , closest (q. left , r.left, r. right)); 
recurse (q. left , furthest (q. left , r.left, r. right)); 
recurse (q. right , closest (q. right , r.left, r. right)); 
recurse (q.right, furthest (q.right, r.left, r. right)); 

} 

} 



Figure 1: Pseudo-code for the dual-tree all nearest 
neighbor algorithm 



4.3.3 The Dual Tree Algorithm for all furthest 
neighbor algorithm. Computing the furthest neigh- 
bor with the naive computation is also of quadratic com- 
plexity. The use of trees can speed up the computations 
too. It turns out that furthest neighbor search for a sin- 
gle query point is very similar to the nearest neighbor 
search presented in the original paper of kd-tree jQj . The 
only difference is that in the top-down recursion the al- 
gorithm always chooses the furthest node. Similarly in 
the bottom up recursion we prune a node only if the 
maximum distance between the point and the node is 
smaller than the current furthest distance. The pseudo 
code is presented in fig. [2l 

5 Experimental Results 

In order to evaluate and compare the performance of 
isoNMF with traditional NMF we picked 3 benchmark 
datasets that have been tested in the literature: 



recurse (q : KdTree, r : KdTree) { 

if (inin_f urthest_neighbor_distance_in_node (q) 
< distance(q, r) { 
/* prune */ 

} else if (IsLeaf (q)==true and IsLeaf (r)==true) { 
/* search for every point in q node its 
/* furthest neighbor in the r node */ 
/* at leaves we must resort to */ 
/* exhaustive search 0(n"2) */ 

/*update the minimum_f urthest_distance_in_node (q) */ 
} else if (IsLeaf (q)==false and IsLeaf (r)=true { 
/*choose the child that is furthest to r */ 
/* and recurse first */ 

recurse (furthest (r, q.left, q.right), r) 
recurse(closest(r, q.left, q.right), r) 
} else if (IsLeaf (q)==true and IsLeaf (r)==false) { 
/* choose the child that is furthest to q */ 
/* and recurse first */ 

recurse (q, furthest (q, r.left, r. right)) 
recurse(q, closest(q, r.left, r. right)) 
} else { 

recurse (q. left , furthest (q. left , r.left, r. right)); 
recurse (q. left , closest (q. left , r.left, r. right)); 
recurse (q. right , furthest (q. right , r.left, r. right)); 
recurse (q. right , closest (q.right , r.left, r. right)); 

} 

} 



Figure 2: Pseudo-code for the dual-tree all furthest 
neighbor algorithm 

1. The CBCL faces database fig. |31[a,b) [13j, used in 
the experiments of the original paper on NMF [21] . 
It consists of 2429 grayscale 19 x 19 images that 
they are hand aligned. The dataset was normalized 
as in [21]. 

2. The isomap statue dataset fig. [S^c) [14 consists 
of 698 64 X 64 synthetic face photographed from 
different angles. The data was downsampled to 
32 X 32 with the Matlab imresize function (bicubic 
interpolation) . 

3. The ORL faces [H] fig.^d) presented in [12]. The 
set consists of 472 19 x 19 gray scale images that 
are not aligned. For visualization of the results we 
used the nmfpack code available on the web [16j . 

The resuhs for classic NMF and isoNMF with k- 
neighborhood equal to 3 are presented in fig. [4] and 
tables [U [21 We observe that classic NMF gives always 
lower reconstruction error rates that are not that far 
away from the isoNMF. Classic NMF fails to preserve 
distances contrary to isoNMF that always does a good 
job in preserving distances. Another observation is 
that isoNMF gives more sparse solution than classic 
NMF. The only case where NMF has a big difference 
in reconstruction error is in the CBCL-face database 



when it is being preprocessed. This is mainly because 
the preprocessing distorts the images and spoils the 
manifold structure. If we don't do the preprocessing 
fig. mf), the reconstruction error of NMF and isoNMF 
are almost the same. We would also like to point 
that isoNMF scales equally well with the classic NMF. 
Moreover they are seem to show the same sensitivity to 
the initial conditions. 

In fig. [S] we see a comparison of the energy spec- 
trums of classic NMF and isoNMF. We define the spec- 
trum as 



classic NMF 


cbcl norm. 


cbcl 


statue 


orl 


rec. error 
sparsity 
dist. error 


22.01% 
63.23% 
92.10% 


9.20% 
29.06% 
98.61% 


13.62% 
48.36% 
97.30% 


8.46% 
46.80% 
90.79% 



Table 1: Classic NMF, the relative root mean square 
error, sparsity and distance error for the four different 
datasets (cbcl normalized and plain, statue and orl) 



isoNMF 


cbcl norm. 


cbcl 


statue 


orl 


rec error 
sparsity 
dist. error 


33.34% 
77.69% 
4.19% 


10.16% 
43.98% 
3.07% 


16.81% 
53.84% 
0.03% 


11.77% 
54.86% 
0.01% 



This represents the energy of the component normalized 
by the energy of the prototype image generated by 
NMF/isoNMF. Although the results show that isoNMF 
is much more compact than NMF, it is not a reasonable 
metric. This is because the prototypes (rows of the 
H matrix are not orthogonal to each other. So in 
reality EiLi < Eili and actually 

much smaller. This is because the dot product between 
the rows is not zero. 
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Figure 3: (a) Some images from the cbcl face database 
(b)The same images after variance normalization, mean 
set to 0.25 and thresholding in the interval [0,1] (c)The 
synthetic statue dataset from the isomap website [14] 
(d)472 images from the orl faces database [T5] 



Table 2: isoNMF, the relative root mean square error, 
sparsity and distance error for the four different datasets 
(cbcl normalized and plain, statue and orl) 

6 Conclusion 

In this paper we presented a deep study of the opti- 
mization problem of NMF, showing some fundamental 
existence theorems for the first time as well as various 
advanced optimization approaches - convex and non- 
convex, global and local. We believe that this study 
has the capability to open doors for further advances 
in NMF-like methods as well as other machine learn- 
ing problems. We also developed and experimentally 
demonstrated a new method, isoNMF, which preserves 
both non-negativity and isometry, simultaneously pro- 
viding two types of interpretability. With the added 
reliability and scalability stemming from an effective op- 
timization algorithm, we believe that this method rep- 
resents a potentially valuable practical new tool for the 
exploratory analysis of common data such as images, 
text, and spectra. 
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